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1  Introduction 

This  project  consists  of  the  development  of  a  software  application  for  the  user-guided  design  of 
a  robotic  system  in  conjunction  with  a  computer-automated  optimization  system  (Fig.  Tj).  In 
Year  1  (Feb,  2011  -  Feb  2012)  we  demonstrated  an  initial  system  for  allowing  a  human  user  to 
configure  a  robot,  its  environment  and  to  specify  a  task.  In  addition,  we  also  demonstrated  a 
paradigm  for  allowing  a  human-user  to  guide  an  automated  optimization  /  machine  learning  system 
to  collaborate  in  solving  a  problem.  In  Year  2  (Feb  2012  -  Feb  2013)  we  integrated  these  two  parts 
into  a  prototype  desktop  application  and  concluded  the  year  with  testing  the  effectiveness  of  this 
approach.  In  year  3  we  extended  our  system  so  that  it  can  support  the  crowd-sourcing  of  robotics 
by  non-expert  users:  multiple  users  collectively  influence  an  optimization  method  (Fig.  [2]). 

This  final  report  is  organized  as  follows.  Section  [2]  is  an  overview  of  this  project  and  sum¬ 
marizes  its  goals.  Section  [3]  describes  the  successful  crowdsourcing  of  robotics:  we  have  shown 
that,  when  two  non-experts  interact  with  our  robotics  system,  robot  controllers  are  developed  more 
rapidly  compared  to  two  users  acting  independently,  or  one  user  working  alone. 

The  main  new  deliverables  are  the  submission  of  two  manuscripts  that  documents  the  design, 
deployment  and  results  from  crowdsourcing  robotics.  Work  described  in  the  first  manuscript  was 
summarized  in  the  previous  QPR.  Worked  described  in  the  second  manuscript  are  summarized  in 
this  Report. 

Wagy,  M.,  Homby,  G.  S.  &  Bongard,  J.  C.  (2014).  Crowdsourced  robot  design  aided  by  evolu¬ 
tionary  computation.  The  Fourteenth  International  Conference  on  the  Synthesis  and  Simulation  of 
Living  Systems  (ALife  XIV).  In  review. 

Bernatskiy,  A.,  Homby,  G.  S.  &  Bongard,  J.  C.  (2014).  Improving  Robot  Behavior  Optimiza¬ 
tion  by  Combining  User  Preferences.  The  Fourteenth  International  Conference  on  the  Synthesis 
and  Simulation  of  Living  Systems  ( ALife  XIV).  In  review. 
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(a) 


(b) 


Figure  1 :  An  overview  of  the  software  tool  for  human-computer,  collaborative  design  of  robotic 
systems  which  is  the  goal  of  this  project.  The  user  first  specifies  the  robot  and  the  controller  task 
( 1(a) )  and  then  assists  the  computer  in  creating  a  controller  by  indicating  preferred  designs  ( l(b)|). 
Behind  the  scenes,  the  software  is  building  a  user-model  and  is  using  this  to  assist  in  driving  its 
search  algorithm  (|l(c)|). 


2  Project  Overview 


An  overview  of  how  the  proposed  system  is  as  follows.  In  this  project  we  are  using  advanced 
Interactive  Evolutionary  Algorithms  with  User  Modeling  as  the  core  technologies  in  developing  an 
application  for  enabling  human  users  to  interactively  guide  the  automated  design  of  sophisticated 
robotic  systems  for  mobility/manipulation  tasks.  How  this  application  works  can  be  understood  by 
following  the  three  stages  of  its  work  flow: 


1 .  Specify  the  robot  morphology. 


2.  Specify  the  control  task  (Figure  [T(aj]). 

3.  Interactively  guide  the  design  of  a  controller  to  accomplish  the  task  (Figure  [T(bj]). 


First  (1)  the  user  will  either  specify  a  pre-existing  robot  morphology  to  use  or  create  a  new  robot 
using  a  GUI  to  specify  its  morphology.  Next  (2)  the  task  environment  is  specified,  either  by  loading 
a  pre-existing  one  or  by  specifying  it  with  a  GUI.  Once  the  task  environment  is  set,  the  control  task 
to  be  generated  is  specified  by  indicating  the  desired  starting  and  stopping  states  of  the  robot  and 
objects  in  the  environment  (Figure  [T(aj]).  Once  the  robot,  environment  and  control  task  have  been 
specified  an  optimization  algorithm  will  being  searching  for  a  controller  to  perform  the  task.  The 
user  will  be  presented  with  example  controllers  generated  by  the  search  algorithm  and  (3)  will  be 
able  to  indicate  to  the  computer  which  results  are  more  promising  (Figure  |l(b)[).  Based  on  the 
user’s  preferences,  a  model  of  the  user  will  continuously  learned  and  this  model  will  be  used  as  a 
kind  of  fitness  function  to  help  guide  the  optimization  algorithm  (Figure [T(cj]). 

In  addition,  once  a  robot  is  designed  and  its  control  policy  becomes  mired  in  a  local  optimum, 
a  user  may  follow  one  of  two  paths  to  alter  the  search  landscape:  they  may  alter  the  robot’s  task 
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control  policy 


behavior  parameters 


fitness 


Figure  2:  Extending  the  system  from  individual  user  modeling  to  crowdsourcing.  The  core  of  the  system  is 
comprised  of  an  evolutionary  algorithm  that  optimizes  artificial  neural  network  control  policies  (a).  Each  control 
policy  is  evaluated  on  a  simulated  robot  (green;  b).  In  this  example,  control  policies  are  evolved  to  guide  the  robot 
around  a  barrier  (cyan)  to  reach  a  target  object  (red).  However,  search  converges  at  a  local  optimum,  which  corresponds 
to  the  robot  reaching  the  barrier  but  going  no  further  (arrow).  To  date,  we  have  developed  a  user  model  (d),  which 
collects  data  from  the  user  about  which  behaviors  she  prefers  over  others.  The  model  can  then  output  predictions  about 
how  much  a  user  will  like  an  unseen  behavior  (‘score’).  In  this  example  the  user  has  indicated  increasing  preference 
for  behaviors  that  guide  the  robot  to  the  right  edge  of  the  barrier  (dotted  lines  in  e).  The  optimizer  (c)  now  selects 
for  control  policies  that  increase  fitness  and  increase  scores  output  by  the  user  model.  In  this  past  quarter  we  have 
extended  to  code  to  accommodate  multiple  users  (f-k).  In  this  example  the  system  collects  preferences  from  two  users 
(j,k)  and  creates  three  user  models:  one  trained  on  preferences  from  just  user  1  (g),  just  user  2  (h),  and  both  users  (i), 
respectively.  Assuming  both  users  prefer  the  same  kinds  of  behaviors,  the  user  model  trained  on  the  combined  training 
set  from  both  users  (i)  will  have  lower  error  than  the  individual  user  models  (g,h)  and  will  be  used  for  influencing 
behavior  optimization.  If  two  users  provide  diverging  preferences  however  (p,q),  the  combined  user  model  (o)  will 
have  a  higher  error  than  the  two  individual  user  models  (m,n).  Thus  optimization  (1)  will  select  control  policies  that 
maximize  fitness  and  obtain  high  scores  from  either  of  the  two  individual  user  models  (i.e.,  at  least  one  of  the  users 
will  like  the  behavior  produced  by  this  control  policy). 
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environment,  or  they  may  indicate  user  preferences.  As  an  example  of  the  former  approach  we 
have  evolved  behaviors  for  a  brachiating  robot  that  swings  underneath  suspended  rungs.  There  are 
local  optima  however  in  which  the  robot  manages  to  swing  up  between  rungs  and  ‘walk’  across 
their  tops.  The  user  can  remove  these  local  optima  from  the  search  space  by  placing  planks  just 
above  the  rungs  so  that  the  robot  hits  the  planks  when  trying  to  swing  up  between  the  rungs.  As  an 
example  of  the  latter  approach,  the  user  can  now  indicate  preferences  to  guide  search  away  from 
these  degenerate  solutions:  The  user  is  repeatedly  shown  two  robots  (controlled  by  two  competing 
control  policies);  one  gets  between  and  above  the  rungs  and  ‘walks’  slightly  further  than  the  second 
robot  which  swings  under  the  rungs  but  falls  off  earlier.  If  the  user  repeatedly  indicates  preference 
for  this  latter  behavior,  search  will  focus  on  improving  control  policies  that  keep  the  robot  below 
the  rungs.  Thus  the  amount  of  interaction  the  user  is  willing  to  provide  is  tunable:  the  user  may 
render  the  process  more  manual  by  defining  many  intermediate  states  close  to  one  another  in  the 
control  and  behavioral  space  of  the  robot,  or  more  automatic  by  defining  few  states  and  placing 
many  aspects  of  the  robot’s  controller  and  morphology  under  evolutionary  control. 


3  Crowdsourcing  Robotics 

Recently  it  has  been  demonstrated  that  collaboration  between  automated  algorithms  and  human 
users  can  be  especially  effective  in  robot  behavior  optimization  tasks.  In  particular,  we  recently 
introduced  a  Fitness-based  Search  with  Preference-based  Policy  Learning  (FS-PPL)  approach,  in 
which  the  algorithm  models  the  user  based  on  her  preferences  and  then  uses  the  model,  along  with 
the  fitness  function,  to  guide  search.  However,  so  far  only  interaction  between  a  single  human 
user  and  an  evolutionary  algorithm  was  considered.  If  multiple  users  contribute  preferences,  the 
algorithm  must  determine  whether  to  model  them  separately  or  jointly.  Here,  we  describe  an 
algorithm  in  which  one  evolutionary  algorithm  interacts  with  two  users  and  determines  the  best  way 
to  model  them  automatically.  We  test  the  algorithm  with  automated  substitutes  for  human  users 
and  show  that  it  performs  better  for  two  users  working  together  than  for  the  same  users  working 
separately,  thus  demonstrating  the  potential  for  crowdsourcing  robot  behavior  optimizatiorfj] 

3.1  Introduction 

Historically,  interactive  evolutionary  algorithms  are  typically  used  to  solve  search  problems  in 
which  automatic  evaluation  of  a  solution  candidate  is  impractical  for  some  reason  -  for  example, 
artistic  tasks.  In  this  case  the  duty  of  solution  evaluation  is  fully  entrusted  to  the  user.  A  lot 
of  studies  were  made  in  regard  to  this  approach,  many  successful  algorithms  were  designed  (for 
example  (@))  and  many  of  those  allow  multiple  users  to  collaborate  on  the  same  problem  ([IT, 
EE210D).  Much  less  is  known,  however,  about  the  algorithms  which  distribute  the  burden  of  solution 
candidates  evaluation  between  the  users  and  the  computer. 

In  this  work  we  employ  this  latter  approach  to  address  an  important  issue  arising  in  traditional 
fitness-based  evolutionary  algorithms  -  namely,  the  phenomenon  of  premature  convergence,  i.e. 
convergence  to  a  local  optimum  with  a  large  basin  of  attraction  rather  than  to  the  global  opti- 

'The  following  is  adapted  from  A  Bernatskiy,  GS  Hornby,  JC  Bongard  (2014).  Improving  robot  behavior  optimiza¬ 
tion  by  combining  user  preferences.  14th  Inti  Confon  the  Synthesis  and  Simulation  of  Living  Systems.  In  preparation. 


4 


mum  with  a  much  narrower  basing  One  approach  used  to  combat  this  problem  is  to  use  multiple 
objectives  instead  of  just  a  single  fitness  value  to  evaluate  solutions.  Some  objectives  shown  to 
be  effective  are  age  (|[9fl  )  and  novelty  (0).  However,  depending  on  a  task,  even  multiobjective 
algorithms  can  become  trapped  on  local  optima. 

For  some  tasks  this  problem  can  be  greatly  reduced  by  adding  human  preference  as  an  opti¬ 
mization  objective.  This  is  particularly  true  for  robot  behavior  optimization,  because  humans  have 
good  intuition  about  legged  locomotion  and  are  able  to  visually  determine  that  search  has  become 
trapped  on  a  local  optimum  (0).  The  major  problem  with  these  methods,  however,  is  the  quantity 
of  preferences  required  from  the  user,  which  is  often  so  demanding  that  it  makes  the  algorithm  too 
labor-intensive  to  be  practical. 

This  problem  can  be  approached  in  several  ways.  One  way  is  to  use  a  machine  learning  algo¬ 
rithm  to  build  a  model  of  the  user  and  then  use  the  model  to  supply  preferences  on  the  human  user’s 
behalf  as  behavior  optimization  continues  (lfT3lfT0,  HI  [3lD.  In  (fl3j)  we  investigated  the  efficiency 
of  this  approach  in  a  robot  behavior  optimization  task  with  a  deceptive  fitness  landscape.  Using  an 
algorithm  based  on  Age-Fitness  Pareto  Optimization  (AFPO)  (0)  with  an  additional  user  prefer¬ 
ence  objective  and  a  neural  network-based  user  model,  we  showed  that  a  user  model  and  fitness 
function  together  can  guide  the  search  to  convergence  more  rapidly  (in  terms  of  wall-clock  time) 
than  either  of  them  on  its  own. 

Another  way  to  cope  with  the  labor  intensity  of  interactive  evolution  is  to  utilize  evaluations 
coming  from  multiple  users.  This  approach  has  been  investigated  theoretically  to  some  extent 
(021)  and  successfully  applied  to  artistic  tasks  ([  LFE!). 

Our  hypothesis  is  that  it  is  possible  to  make  the  optimization  of  robot  behavior  faster  by  col¬ 
lecting  evaluations  simultaneously  generated  by  multiple  users  into  one  common  evolutionary  al¬ 
gorithm.  Consider  an  algorithm  which  attempts  to  learn  preferences  supplied  by  multiple  users 
based  on  their  evaluations.  If  n  users  simultaneously  indicate  preferences  and  if  their  preferences 
agree,  then  the  machine  learning  algorithm  can  train  on  these  preferences  as  if  they  were  indicated 
by  a  single  user.  Therefore,  it  will  have  up  to  n  times  more  training  data,  which  will  allow  it  to 
build  an  accurate  user  model  faster. 

If  user  preferences  disagree,  the  algorithm  will  have  to  model  users  separately  using  their  re¬ 
spective  preference  sets.  In  this  case  the  speed  of  learning  of  each  user  model  is  reduced  back 
to  the  level  of  the  single  user  case,  and  additional  computational  costs  associated  with  training 
multiple  user  models  can  impact  the  performance  of  the  behavior  optimization  method  (see  the 


Experiments  section).  However,  disagreement  in  users’  preferences  is  likely  to  indicate  that  more 
than  one  global  optimum  -  or  several  similar  (in  terms  of  fitness)  local  optima  -  have  been  intuited 
by  the  users  and  are  present  in  the  fitness  landscape.  In  the  latter  case  it  is  possible  to  exploit  the 
disagreement  to  explore  both  of  the  user-favored  optima,  evaluate  them  and  determine  if  one  of  the 
user- favored  optima  is  better  then  the  other  in  terms  of  fitness. 

To  test  these  suppositions  we  have  developed  an  interactive,  user-modeling  algorithm  which 
can  simultaneously  accept  preferences  from  one  or  two  users.  We  measure  its  performance  with 
two  users  working  together  and  compare  it  to  the  combined  performance  of  two  users  working 
separately,  each  with  her  own  evolutionary  algorithm  and  user  model. 


-Fitness  landscapes  with  such  optima  are  said  to  be  deceptive. 


5 


Figure  3:  Test  problem,  (a)  Side  and  top  views  of  the  robot  and  its  environment  at  the  beginning 
of  the  simulation.  The  small  square  to  the  left  denotes  the  light  source;  spheres  on  the  robot’s  body 
are  light  sensors.  The  target  position  that  the  robot  should  reach  is  depicted  with  dotted  lines.  Yb 
denotes  the  Y  coordinate  of  the  barrier,  (b)  Joint  between  the  robot’s  main  body  (square  plate) 
and  a  limb,  top  view.  The  dotted  line  denotes  the  axis  of  rotation.  The  angle  of  the  limb’s  rotation 
relative  to  its  default  position  (as  in  (a))  can  take  values  in  [—45°,  45°].  A  video  of  the  robot  with 
a  successfully  evolved  controller  can  be  viewed  at  http :  /  /  youtu  .  be/ByDf  AcDBsHI  . 

3.2  Test  Problem 

We  use  the  test  problem  from  ([0).  The  goal  is  to  navigate  a  simple  quadrupedal  robot  around  the 
wall  to  a  target  object  on  the  far  side  (Fig.  [3^).  The  robot  is  composed  of  a  square  plate  and  four 
rigid  vertical  legs,  each  attached  to  the  plate  by  an  actuated  joint  with  one  degree  of  freedom  (Fig. 

i>). 

Each  body  part  has  one  light  sensor  and  one  touch  sensor.  Signals  from  the  photosensors 
are  real  values  from  [0, 1]  varying  linearly  depending  on  their  euclidean  distance  from  the  light 
source^]  Touch  sensors  produce  1  if  the  body  part  touches  the  ground  or  collides  with  the  wall  and 
—1  otherwise.  Additionally,  the  robot  is  equipped  with  a  compass  sensor  which  gives  the  current 
robot’s  orientation  relative  to  the  Y  axis,  normalized  to  be  in  [0, 1]. 

The  robot  is  controlled  by  a  feedforward  neural  network  without  hidden  nodes.  A  total  of  11 
sensors  connect  to  four  actuators,  which  yields  a  total  of  44  synaptic  weights.  Hereafter  we  will 
refer  to  a  particular  set  of  synaptic  weights  as  a  controller. 

3The  sensors  saturate  to  0  for  distances  greater  than  40,  which  is  about  5  times  further  than  any  robot  traveled  in 
our  experiments. 
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3.3  Methods 


The  algorithm  uses  a  client-server  computational  architecture.  The  client  here  is  an  interactive  pro¬ 
gram  which  takes  a  pair  of  controllers  as  input,  simulate^]  two  copies  of  the  robot  with  controllers 
from  the  pair  and  shows  the  resulting  behaviors  to  the  user  (Fig.  [4]).  The  user  is  forced  to  prefer 
one  -  she  cannot  skip  a  pair.  After  the  preference  is  provided,  the  client  sends  it  to  the  server. 

The  server  performs  the  following  functions: 


•  it  supplies  controllers  to  and  receives  preferences  from  multiple  clients  via  asynchronous 
communication; 

•  it  optimizes  the  robot’s  behavior  with  an  evolutionary  algorithm; 

•  it  generates  the  controller  pairs  to  be  evaluated  by  users  and  maintains  the  users’  preference 
tables; 

•  it  trains  the  user  models  based  on  users’  preferences; 

•  it  employs  predictions  from  the  user  models  along  with  the  fitness  function  to  guide  the 
evolutionary  algorithm. 


A  user  model  is  defined  as  a  mapping  from  a  pair  of  robot  behaviors  to  a  prediction  of  the  user’s 
preference  for  this  pair.  The  mapping  is  learned  by  an  artificial  neural  net worl^]  with  a  hidden  layer 
using  backpropagation.  For  details,  see  the  User  Models  section  below. 

If  only  one  user  has  supplied  preferences  so  far,  only  one  user  model  is  maintained.  If  two 
users  supply  preferences,  the  program  must  find  an  optimal  way  to  utilize  these.  For  this  purpose 
our  program  maintains  three  separate  user  models  -  one  individual  model  for  each  user  and  one 
collective  model ,  which  is  trained  on  the  combined  preferences  of  both  users.  For  details  see  the 
[Coordinated  Score  Generation!  section  below. 


3.3.1  Evolutionary  Algorithm 

For  robot  behavior  optimization  the  server  uses  Age-Fitness  Pareto  Optimization  ([0),  an  evolu¬ 
tionary  algorithm  with  two  explicit  objectives  -  fitness  and  age.  In  all  experiments  described  below 
the  algorithm  starts  with  a  population  of  30  controllers,  initialized  with  random  synaptic  weights 
in  [—1,1].  The  server  simulates  controllers  sequentially  and  records  the  full  time  series  of  the  re¬ 
sulting  sensor  values.  When  all  controllers  in  the  population  have  been  simulated,  the  algorithm 
calculates  their  fitness  values  and  constructs  the  Pareto  front,  taking  the  time  controllers  have  spent 
in  the  population  -  their  age  -  into  account.  The  next  generation  is  composed  of 

•  one  new,  completely  random  controller, 

•  nondominated  controllers  from  the  previous  population  and 


their  mutated  copies,  in  a  quantity  sufficient  to  restore  the  initial  size  of  the  population. 


4 All  physics  simulations  use  Open  Dynamics  Engine,  http :  //www .  ode  .  orq. 
5This  network  is  not  to  be  confused  with  the  robot’s  controller  (see  the 


Test  Problem 


section),  which  is  another 


artificial  neural  network  employed  in  the  program.  Unlike  the  one  described  here  that  one  has  no  hidden  neurons. 
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Figure  4:  Screenshot  of  two  clients  running  on  the  same  computer.  The  user  can  select  a  behavior 
she  likes  by  cycling  through  the  robots.  The  selected  robot  is  highlighted  and  the  other  one  is 
made  translucent.  The  preference  is  sent  to  the  server  as  soon  as  the  user  confirms  the  selection. 
For  example,  in  the  left  window  the  user  is  about  to  confirm  her  preference  towards  the  highlighted 
robot  to  the  right  of  the  other  contestant. 


The  fitness  function  is 

/  =  fuO ,  (1) 

where  fu  is  the  unsealed  fitness  (|[3ll): 


"  '  "  l  +  (E?.iEtill4‘)-4)ll)/5T 

T  =  1000  here  is  the  number  of  time  steps  during  which  behavior  is  simulated,  is  a  value  of 

(r) 

zth  light  sensor  at  time  step  t,  and  s)  ’  is  the  value  of  the  / 1 h  light  sensor  at  the  goal  position  (see 
Fig#). 

a  is  the  coordinated  score :  a  number  in  [0, 1]  which  represents  a  combined  prediction  from  all 
of  the  user  models  about  how  much  the  user  (or  users)  would  like  this  controller.  In  particular, 
a  near  1  indicates  that  at  least  one  of  the  two  user  models  tended  to  prefer  this  controller  when 
it  was  presented  multiple  times,  while  a  score  near  0  indicates  that  the  user  models  predict  that 
both  users  will  greatly  dislike  this  controller.  In  the  beginning  of  the  program’s  operation,  when 
no  users’  preferences  have  been  provided  yet,  it  is  equal  to  0.5  for  all  controllers.  For  details  on  o 
see  [Coordinated  Score  Generation!  section  below. 

In  the  current  implementation,  the  second  generation  commences  only  after  the  first  pair  of 
controllers  has  been  evaluated  by  a  user.  This  ensures  that  the  coordinated  score  o  affects  evolution 
from  the  outset.  However,  in  practice,  this  should  have  little  impact  on  evolution,  because  the  user 
models  learn  more  slowly  than  the  evolutionary  algorithm  improves  the  robot’s  behavior:  it  takes 
many  before  the  user  models’  predictions  deviate  significantly  from  0.5. 

3.3.2  User  Preference  Gathering 

After  evaluating  the  first  generation,  the  server  ranks  the  controllers  from  the  Pareto  front  by  fitness 
and  requests  the  evaluation  of  the  four  best  controllers  from  the  users.  The  first  user  must  compare 
the  first  and  the  second  controller,  and  the  second  user  compares  the  third  controller  to  the  fourth. 
The  program  waits  for  either  user  to  evaluate  her  pair  and  then  enters  the  evolutionary  loop  of 
reproduction  and  selection  (see  the  section  above).  The  server  never  pauses  to  wait  for  any  user 
action  after  the  indication  of  this  first  preference. 
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Every  time  the  program  evaluates  all  unsealed  fitness  values  fu  of  the  controllers  from  the  cur¬ 
rent  generation,  it  checks  whether  any  of  its  previous  requests  for  user  preferences  were  granted.  If 
that  is  not  the  case,  the  program  continues  with  the  next  iteration  of  the  evolutionary  loop.  Other¬ 
wise,  it  stores  the  obtained  preference  into  a  table  of  preferences,  selects  a  new  pair  of  controllers 
for  user  evaluation,  sends  it  to  the  client  and,  if  appropriate,  retrains  some  user  models  on  the 
expanded  set  of  preferences. 

All  controllers  sent  to  the  client  for  user  evaluation  are  stored,  along  with  their  respective  sensor 
time  series,  in  an  archive.  The  obtained  user  preferences  are  stored  in  the  preference  table  P,  such 
that  P[i,j]  =  1  if  the  /th  controller  of  the  archive  was  preferred  to  the  jth,  -1  if  the  jth  controller 
was  preferred  to  the  zth,  and  0  if  the  preference  is  neutral  (in  the  current  implementation  that  is 
possible  only  for  P[i,  z])  or  not  yet  known  (fl3]|,  iflTl). 

To  accelerate  the  filling  of  the  preference  table  we  assume  that  user  preferences  are  transitive. 
Consider  a  situation  when  the  user  has  seen  n  controllers  ci,  c2,  •  •  • ,  cn  so  far,  and  for  every  i  <  j 
she  preferred  c3  to  q .  The  program  assumes  then  that  if  a  new  controller  d  is  preferred  over  c3 
for  some  j  <  n,  then  all  controllers  q  (for  which  i  <  j )  are  assumed  to  not  be  preferred  over  d. 
Similarly,  if  d  is  not  preferred  over  c3 ,  then  all  controllers  from  the  upper  part  of  the  ranking,  q 
(with  i  >  j)  are  assumed  to  be  preferred  over  d. 

To  determine  how  a  new  controller  fares  against  controllers  previously  shown  to  a  user,  the 
program  uses  a  version  of  binary  search  adapted  for  our  purposes.  First,  for  each  controller  already 
shown  it  produces  a  score:  the  number  of  times  this  particular  controller  was  preferred  to  its  peers 
minus  the  number  of  times  it  was  not  preferred.  If  a  new  controller  d  is  preferred  over  some 
previously  shown  controller  q,  then  it  is  assumed  that  d  is  preferred  over  all  previously  shown 
controllers  with  a  score  less  than  or  equal  to  the  score  of  q,  and  the  corresponding  entries  of 
P [i ,  j ]  are  stored.  Similarly,  if  some  previously  shown  controller  q  is  preferred  to  the  new  one,  the 
algorithm  assumes  that  all  controllers  with  the  score  higher  than  that  of  q  are  preferred  to  the  new 
controller. 

The  old  controllers  are  shown  to  the  user  (paired  with  the  new  controller)  in  the  following 
order:  the  controller  with  the  highest  score  is  shown  first,  then  the  one  with  the  lowest  score  and 
then  -  repeatedly  -  the  closest  one  to  the  middle  of  the  current  interval  of  possible  values  of  the 
score  for  the  new  controller.  The  algorithm  terminates  when  all  of  the  relationships  between  the 
new  controller  and  the  previously  known  ones  are  established.  In  the  worst  case  this  happens  after 
the  user  has  indicated  2  +  log2  n  preferences;  in  the  best  case  one  preference  is  sufficient. 

When  the  binary  search  described  above  terminates,  two  events  occur.  First,  a  new  controller  is 
selected  among  the  current  evolutionary  population  to  be  evaluated  by  the  user.  In  the  experiments 
described  here,  the  algorithm  selected  the  most  fit  controller  among  those  which  have  not  been  seen 
by  any  user  yet.  The  server  sends  the  pair,  as  dictated  by  the  first  step  of  the  bisection  algorithm 
described  above,  to  the  user. 

Second,  two  user  models  are  retrained:  the  individual  model  corresponding  to  the  newly  gath¬ 
ered  preference’s  author  and  the  collective  user  model.  The  models  are  trained  on  the  fully  eval¬ 
uated  subsets  of  users’  archives,  i.e.  on  those  subsets  for  which  the  preference  is  known  for  each 
pair  of  controllers  in  the  subset.  The  individual  model  is  retrained  on  the  preference  table  of  the 
user  who  indicated  the  last  preference;  the  collective  model  uses  the  tables  of  both  users. 

This  process  of  robot  behavior  optimization,  preference  gathering,  and  user  modeling  is  re¬ 
peated  indefinitely,  or  until  the  server  process  is  terminated. 

Note  that  with  the  pair  selection  strategy  described  above  a  user  never  gets  to  evaluate  a  con- 
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troller  which  has  already  been  seen  by  her  peer.  The  motivation  for  this  is  twofold.  First,  this 
approach  maximizes  the  diversity  of  controllers  available  to  the  collective  user  model,  which  in 
turn  maximizes  its  potential  for  accurate  prediction.  Second,  it  facilitates  the  detection  of  situa¬ 
tions  when  the  learned  user  models  tend  to  overfit  the  user  data  (see  the  [33]  section). 


3.3.3  User  Models 


A  user  model  is  a  mapping  between  the  robot’s  behavior  and  an  assessment  of  its  quality  by  the 
user.  In  this  particular  algorithm  we  employ  a  mapping  which  takes  as  input  two  robot  behaviors 
compressed  into  feature  vectors  and  maps  them  onto  a  value  from  [—1,1],  approximating  the  record 
of  the  preference  table  P  (see  the  User  Preference  Gathering  section). 

In  all  experiments  described  here  we  use  the  values  from  six  sensors  (five  light  sensors  and  one 
compass  sensor)  of  the  robot  recorded  at  the  middle  (t  =  T/ 2)  of  the  simulation  as  the  feature 
vector  ([]3l).  This  kind  of  compressed  representation  simplifies  the  problem  of  learning  the  user 
model.  Designing  a  general  way  in  which  such  a  vector  can  be  generated  to  facilitate  learning  is  a 
nontrivial  problem  and  it  is  not  considered  in  this  work. 

The  mapping  is  learned  by  an  artificial  neural  network  with  12  inputs  -  six  for  each  feature 
vector  of  the  two  controllers  which  the  model  is  supposed  to  compare.  These  neurons  are  connected 
to  the  only  output  of  the  network  through  a  single  hidden  layer  containing  12  neurons. 

For  convenience,  the  output  neuron  is  trained  to  reproduce  not  the  P[i,j\  itself,  but  its  linear 
transformation  to  [0, 1]: 


target(),  j)  = - - - . 


(3) 


The  network  is  trained  using  error  backpropagation  ([HI,  fl3j|).  The  algorithm  iterates  through 
all  entries  of  the  preference  table  P[i,j]  and  backpropagates  the  network’s  errors  associated  with 
each  entry  once.  If  the  network  being  trained  is  the  collective  user  model,  the  same  procedure 
is  applied  to  the  other  user’s  preference  table  as  well.  Then  it  iterates  through  all  of  the  entries 
again  and  compares  the  sign  of  the  model’s  prediction  to  the  sign  of  the  original  entry.  If  the 
signs  coincide  for  all  entries,  the  network  is  considered  to  be  successfully  trained.  Otherwise,  the 
procedure  is  repeated,  but  no  more  than  104(m/2  —  n)  times,  where  m  is  the  total  number  of  table 
entries  and  n  is  the  total  number  of  controllers.  If  this  number  is  reached,  the  learning  process  is 
considered  to  have  failed. 

Depending  on  the  outcome  of  the  learning  procedure,  the  algorithm  assigns  model  errors  to 
each  generated  user  model  as  follows: 


•  10  if  the  learning  failed; 


•  2  if  the  learning  was  attempted  on  one  preference  table  and  succeeded; 


•  1  if  the  learning  was  attempted  on  two  preference  tables  and  succeeded. 

This  value  is  used  to  determine  the  optimal  way  to  utilize  the  three  user  models.  As  we  will  see 
in  the  next  section,  the  behavior  of  the  algorithm  we  use  to  accomplish  that  does  not  depend  on 
the  particular  values  we  chose  to  represent  the  models  errors,  but  rather  on  the  relative  position  of 
these  values  on  the  real  axis  with  respect  to  each  other. 
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3.3.4  Coordinated  Score  Generation 


To  generate  coordinated  scores  a  (see  the  Evolutionary  Algorithm  section  above)  for  the  newly- 
evolved  controllers,  the  the  server  starts  by  producing  scores  based  on  each  one  of  the  user  models 
present  (OJ). 

To  determine  these  scores  for  an  evolutionary  population  of  size  30,  each  user  model  fills  a 
30  x  30  table  V[i,j]  with  its  preference  approximations  (from  [0, 1]).  The  score  is  then  calculated 
as 


2  30 

°k(j)  = 


(4) 


where  k  e  {0, 1,  C}  is  the  index  of  the  user  model:  0  and  1  correspond  to  the  first  and  second 
individual  user  models  respectively  and  C  corresponds  to  the  collective  modej^J 

Denoting  the  errors  of  the  models  (defined  in  the  previous  section)  as  e0,  and  ec,  the  coordi¬ 
nated  score  a  can  be  computed  as  follows: 


1.  If  there  is  only  one  user  model,  use  its  score; 

2.  If  ec  <  e0  and  ec  <  ei,  use  ac', 

3.  Otherwise,  use  max(cr0,  cri). 


The  first  rule  describes  the  trivial  behavior  the  algorithm  exhibits  when  only  one  user  supplies 
preferences.  The  second  corresponds  to  the  condition  under  which  the  score  from  the  collective 
user  model  should  be  used.  With  model  error  defined  as  we  did  in  the  previous  section,  this 
decision  is  always  made  when  the  backpropagation  algorithm  was  able  to  train  the  collective  user 
model  successfully.  The  basis  for  this  decision  is  the  assumption  that  if  it  is  possible  to  successfully 
train  the  collective  user  model  on  the  data  provided  by  two  independent  users,  then  these  users  are 
likely  to  be  “allied”,  i.e.,  they  are  guiding  the  evolutionary  search  towards  the  same  optimum. 

The  third  rule  describes  the  case  when  users  are  likely  to  have  different  opinions  regarding 
which  optimum  is  a  global  one.  In  that  case  the  max  function  helps  to  retain  controllers  which 
are  favored  by  one  of  the  two  users.  This  allows  us  to  take  both  users’  opinions  into  account  and 
subject  behaviors  favored  by  each  one  of  them  to  direct  competition  in  the  evolutionary  algorithm. 

We  do  not  consider  users  who  make  errors  or  change  their  opinion  over  time  in  this  work. 


3.4  Experiments 


To  reduce  the  amount  of  effort  required  to  test  the  algorithm  and  increase  the  experiments’  repro¬ 
ducibility,  we  employed  surrogate  users  in  place  of  humans  (Q).  A  surrogate  user  is  a  version  of 
the  client  program  which  simulates  the  behavior  of  a  human  user  with  particular  preferences.  In 
our  experiments  surrogate  users  preferred  robots  that  attempt  to  circumnavigate  around  the  right 
edge  of  the  barrier,  which  is  detected  by  the  surrogate  user  by  measuring  which  one  of  the  two 
controllers  yields  the  largest  X  coordinate  for  the  robot’s  position  at  the  mid-point  of  the  simula¬ 
tion  (t  =  T/2)  (henceforth  referred  to  as  a  surrogate  user  preferring  the  “rightmost”  behavior).  For 


6 A  similar  metric  was  defined  in  the  User  Preference  Gathering  section  to  rank  controllers  by  the  degree  to  which 
a  user  likes  or  dislikes  them.  The  value  we  generate  here  serves  a  similar  purpose,  but  is  computed  using  a  different 
set  of  controllers. 
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Figure  5:  Time  series  for  some  parameters  of  the  server  over  the  course  of  typical  allied  simula¬ 
tions:  (a)  an  unsuccessful  simulation  and  (b)  a  successful  simulation.  The  three  topmost  graphs 
represent  the  unsealed  fitness  fu,  age  and  the  coordinated  score  a  of  the  current  best  controller  in 
the  evolutionary  population  (computed  using  the  product  fua ).  The  red  dotted  line  in  the  fitness 
graph  (top)  shows  a  rough  estimate  of  the  maximal  value  of  fu  which  the  robot  not  going  around 
the  barrier  can  have  (0.88).  The  fourth  graph  from  the  top  shows  how  the  way  in  which  the  server 
modeled  users  changed  over  the  course  of  the  simulation:  “coll”,  “uO”  and  “ul”  indicate  the  usage 
of  the  collective  user  model  and  the  individual  models  of  the  first  and  the  second  user,  correspond¬ 


ingly;  “indiv”  corresponds  to  the  two  users  being  modeled  separately.  See  the  Coordinated  Score 


Generation  section  for  details.  The  graph  at  the  bottom  gives  the  logical  value  “No  robots  above 
the  barrier”:  false  if  there  are  any  controllers  in  the  current  population  which  make  robot  travel 
beyond  the  barrier  (i.e.,  have  Y  >  Yb  at  some  point  of  the  behavior  simulation)  and  true  otherwise. 


simulating  users  with  different  strategies,  we  also  made  a  version  of  the  surrogate  user  who  prefers 
behaviors  with  the  lowest  X  coordinate  at  the  same  point  in  the  evaluation  period  (i.e.  a  user  who 
prefers  “leftmost”  behaviors). 

Also,  the  surrogate  user  stopped  supplying  preferences  and  terminated  the  client  if  it  encoun¬ 
tered  a  controller  which  is  able  to  guide  the  robot  around  the  barrier,  i.e.,  to  have  some  points  in  its 
trajectory  with  Y  greater  than  the  coordinate  Yb  where  the  barrier  is  located  (see  Fig.  [3]). 

3.4.1  Results 

In  the  simulations  discussed  below  the  server  was  run  for  30  minutes  of  wall  clock  time.  One  or 
two  clients  controlled  by  the  surrogate  users  were  run  on  the  same  computer  as  parallel  processes 
(Fig.  [4]).  Once  every  60  seconds  the  clients  supplied  preferences  to  the  server. 

Three  types  of  simulation  were  performed: 
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Simulation 

type 

#  of  wins/ 

#  of  runs 

Rate 

of  success 

#  of  generations  per  run 
Averageistd.  deviation 

#  of  generations  spent 
using  the  collective  model/ 
Total  #  of  generations 

Single  user 

53/300 

0.177 

(3.12  ±0.13)  x  102 

0/937443  (0%) 

Allied  users 

86/300 

0.287 

(2.59  ±0.28)  x  102 

538610/777250  (69%) 

Opposing  users 

25/300 

0.083 

(2.48  ±  0.20)  x  102 

397220/744647  (53%) 

Table  1 :  Experimental  results 


•  single  user  simulations  with  one  surrogate  user  preferring  “rightmost”  behaviors; 

•  allied  simulations  with  two  surrogate  users  preferring  “rightmost”  behaviors; 

•  opposing  simulations  with  one  surrogate  user  preferring  “rightmost”  behaviors  and  one  sur¬ 
rogate  user  preferring  “leftmost”  behaviors. 

We  considered  a  simulation  to  have  succeeded  if  during  the  last  20  generations  it  had  at  least 
one  controller  in  the  server’s  evolutionary  population  which  was  able  to  guide  the  robot  around  the 
barrier  (defined  as  above). 

Figure[5]demonstrates  how  some  parameters  of  the  server  change  over  the  course  of  two  typical 
allied  simulations.  Periodically,  the  age  of  the  most  fit  controller  stays  constant  for  short  time 
periods  (“plateaus”  on  the  age  graphs).  This  occurs  when  the  server  is  busy  with  model  training 
for  a  significant  portion  of  time  and  indicates  the  presence  of  a  significant  computational  overhead 
related  to  training  of  the  user  models. 

The  graphs  for  the  opposing  simulations  are  very  similar.  Graphs  for  single  user  simulations 
differ  from  Figure  [5] in  two  respects.  First,  there  is  no  switching  between  usage  of  individual  and 
collective  user  models  to  guide  evolution:  the  algorithm  only  has  only  one  user  model,  and  it  is  the 
only  one  which  is  ever  used.  Second,  the  amount  of  time  the  server  spends  training  the  user  model 
is  substantially  lower. 

We  performed  300  runs  of  each  of  simulation  type  with  the  servers  configured  as  described 
above.  The  results  are  presented  in  Table  [I] 

We  used  the  one-tailed  Z-test  to  compare  the  success  rates  (©)•  The  rate  of  success  for  allied 
simulations  was  found  to  be  significantly  higher  than  the  success  rate  of  the  single  user  simulations 
(p  <  7  x  10-4),  despite  the  significantly  lower  (p  <  10-4  by  the  standard  t-test)  average  number 
of  evolutionary  generations  per  run. 

The  success  rate  for  the  opposing  simulations  was  found  to  be  significantly  lower  (p  <  7x  1CT4) 
than  the  success  rate  of  the  single  user  simulations.  The  average  number  of  generations  per  run  is 
about  the  same  as  for  the  allied  simulations,  and  is  significantly  less  than  the  number  of  generations 
for  the  single  user  simulations  ( p  <  10”4). 

The  ratio  between  the  number  of  generations  which  the  server  spent  using  the  collective  model 
and  the  total  number  of  generations  was  found  to  be  significantly  higher  (p  <  Hr5)  in  the  allied 
simulations  than  in  the  opposing  simulations. 

Out  of  all  25  opposing  simulations  which  succeeded,  at  least  10  did  so  by  taking  the  robot 
around  the  right  side  of  the  barrier  and  at  least  9  used  the  left  side  of  the  barrier. 
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3.5  Discussion 


If  a  simulation  involves  two  users,  on  average  it  iterates  through  fewer  generations  of  the  evolution¬ 
ary  algorithm  than  a  simulation  with  only  one  user.  This  is  explained  it  as  follows:  in  single  user 
simulations  the  server  maintains  only  one  user  model,  which  reduces  the  computational  expense 
required  for  model  training  compared  to  the  two  user  case  in  which  three  user  models  must  be  con¬ 
tinually  trained  and  re-trained.  This  reduced  computational  burden  is  exploited  by  the  evolutionary 
algorithm,  which  is  now  able  to  perform  more  generations. 

The  results  also  indicate  that  in  allied  simulations  the  program  performs  better  than  in  the 
single  user  simulations.  This  confirms  our  hypothesis  that  it  is  possible  to  accelerate  robot  behavior 
optimization  by  utilizing  preferences  from  multiple  users,  despite  the  additional  cost  incurred  by 
having  to  train  models  of  both  individual  and  collective  user  behavior. 

We  hypothesize  that  the  inferior  performance  of  the  program  when  it  hosts  opposing  users  is 
due  to  the  three  following  factors: 

1.  When  the  coordinated  score  is  generated  as  a  maximum  of  scores  by  the  individual  user 
models,  the  evolutionary  population  is  effectively  divided  into  two  subpopulations,  each  of 
which  consists  of  controllers  favored  by  the  corresponding  individual  user  model.  This  leads 
to  a  growth  of  the  Pareto  front  and  ultimately  slows  down  search.  We  hypothesize  that  this 
problem  may  be  remedied  by  utilizing  an  evolutionary  algorithm  which  treats  the  Pareto 
front  in  a  different  way  and/or  employs  a  larger  population. 

2.  In  the  experiments  presented  here,  during  a  substantial  fraction  of  generations  (53%)  op¬ 
posing  simulations  employed  the  collective  user  model  to  guide  search.  The  collective  user 
model  “successfully”  learned  a  data  set  which  has  implicit  internal  inconsistencies.  That  is, 
the  model  must  learn  to  take  two  similar  inputs  yet  output  two  very  different  predictions:  for 
example,  the  first  user  very  much  liked  the  first  behavior,  but  the  second  user  greatly  disliked 
the  second,  similar  behavior.  This  suggests  that  the  model  has  overfit  the  data  and  its  usage 
can  negatively  impact  the  algorithm’s  search  ability. 

Notice  that  if  there  were  at  least  two  controllers  which  both  users  has  seen,  it  would  become 
impossible  to  “successfully”  train  the  collective  user  model.  This  can  conceal  the  problem 
of  overfitting.  That’s  why,  although  such  overlap  would  help  the  algorithm  to  recognize  the 
situation  when  it  is  better  to  model  users  separately,  it  is  not  allowed  in  the  experiments 
reported  here.  This  was  one  of  the  reasons  why  we  decided  to  query  the  users  on  completely 


disjoint  sets  of  controllers  (see  the  User  Preference  Gathering 


This  problem  might  be  solved  by  using  a  different  metric  for  the  user  model’s  learning  effi¬ 
ciency. 

3.  The  additional  computational  overhead  mentioned  above  in  the  context  of  allied  simulations 
already  places  this  simulation  at  a  computational  disadvantage  compared  to  the  single  user 
simulation. 


Despite  the  general  failure  to  accommodate  opposing  users,  the  algorithm  still  managed  to 
solve  the  task  in  a  substantial  fraction  of  runs.  These  runs  succeeded  by  discovering  both  user- 
favored  optima,  which  indirectly  confirms  our  second  hypothesis  about  the  possibility  of  finding 
and  comparing  multiple  user-favored  optima  in  the  fitness  landscape. 
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3.6  Conclusions 


Our  findings  confirm  that  in  robot  behavior  optimization  tasks  it  is  possible  to  increase  the  perfor¬ 
mance  of  fitness-based,  user-assisted  evolutionary  algorithms  by  utilizing  preferences  from  mul¬ 
tiple  users.  This  constitutes  a  step  towards  fitness-based,  crowd-assisted  algorithms  which  may 
potentially  solve  problems  too  deceptive  to  be  solved  by  purely  automated  algorithms. 

We  demonstrated  that  employing  more  than  one  user  can  help  solve  robot  behavior  optimization 
tasks  in  at  least  two  ways.  First,  if  users  approach  the  task  with  the  same  strategy,  this  approach 
allows  the  optimizer  to  recognize  and  employ  the  strategy  more  rapidly.  Second,  if  the  users 
employ  different  strategies,  it  is  possible  to  find  all  optima  recognized  by  the  users  and  choose  the 
best  one  among  them. 

However,  the  task  of  designing  such  algorithms  is  far  from  trivial.  Here  we  would  like  to 
highlight  some  difficulties  particular  to  search  algorithms  guided  by  multiple  users,  which  employ 
user  modeling.  A  good  algorithm  of  this  type  must 

1.  be  able  to  distinguish  between  different  user  strategies  and  model  each  appropriately; 

2.  employ  user  modeling  algorithms  flexible  enough  to  adapt  to  any  or  almost  any  benign  user 
strategy,  yet  not  overfit  user  input  and  thus  retain  good  extrapolation  properties;  and 

3.  employ  a  search  algorithm  which  is  able  to  retain  good  performance  while  utilizing  user 
models  that  change  in  number  and  quality. 

Every  one  of  these  tasks  constitutes  a  nontrivial  design  problem  in  its  own  right.  However,  we 
believe  that  all  of  these  challenges  can  be  addressed  by  a  suitable  combination  of  machine  learning 
techniques.  Possible  future  work  may  include  utilizing  clustering  to  solve  the  first  subproblem 
listed  above  (pioneered  in  (0)),  evolving  user  models  of  varying  accuracy  and  complexity  to 
address  the  second  one  and  designing  evolutionary  algorithms  with  better  scaling  properties  to 
handle  the  last  one. 
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